AITopics | optimistic policy iteration

2dace78f80bc92e6d7493423d729448e-Reviews.html

Neural Information Processing SystemsOct-3-2025, 08:13:42 GMT

First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. It presents a slight modification of the NAC algorithm, where the original algorithm is a special case which is called forgetful NAC. The authors show that forget full Nac and optimistic policy iteration are equivalent. The authors also present a non-optimality result for soft-greedy Gibbs distribution, I.e., the optimal solution is not a fixed point of the policy iteration algorithm. I liked the unified view on both type of algorithms.

algorithm, iteration, policy iteration, (12 more...)

Neural Information Processing Systems

Country: North America > United States > Nevada (0.05)

Genre:

Summary/Review (0.48)
Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.35)

Add feedback

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Neural Information Processing SystemsSep-30-2025, 11:22:57 GMT

Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD($\lambda$)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of soft-greedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward any optimal policy, except in a certain pathological case. Consequently, in the context of approximations, the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality.

greedy value function method, optimistic policy iteration, policy iteration and natural actor-critic, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.60)

Add feedback

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Neural Information Processing SystemsMar-13-2024, 15:36:20 GMT

Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD(λ)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of softgreedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward an optimal policy, except in a certain pathological case. Consequently, in the context of approximations (either in state estimation or in value function representation), the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality.

iteration, optimistic policy iteration, policy iteration, (15 more...)

Neural Information Processing Systems

Country:

Europe > Finland (0.04)
Asia > Middle East > Jordan (0.04)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms

Amortila, Philip, Precup, Doina, Panangaden, Prakash, Bellemare, Marc G.

arXiv.org Artificial IntelligenceMar-27-2020

We present a distributional approach to theoretical analyses of reinforcement learning algorithms for constant step-sizes. We demonstrate its effectiveness by presenting simple and unified proofs of convergence for a variety of commonly-used methods. We show that value-based methods such as TD($\lambda$) and $Q$-Learning have update rules which are contractive in the space of distributions of functions, thus establishing their exponentially fast convergence to a stationary distribution. We demonstrate that the stationary distribution obtained by any algorithm whose target is an expected Bellman update has a mean which is equal to the true value function. Furthermore, we establish that the distributions concentrate around their mean as the step-size shrinks. We further analyse the optimistic policy iteration algorithm, for which the contraction property does not hold, and formulate a probabilistic policy improvement property which entails the convergence of the algorithm.

algorithm, convergence, operator, (14 more...)

arXiv.org Artificial Intelligence

2003.12239

Country:

North America > United States > Massachusetts > Middlesex County > Belmont (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Sicily > Palermo (0.04)

Genre: Research Report (0.81)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Wagner, Paul

Neural Information Processing SystemsFeb-14-2020, 17:26:58 GMT

Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD($\lambda$)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of soft-greedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward any optimal policy, except in a certain pathological case. Consequently, in the context of approximations, the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality.

greedy value function method, optimistic policy iteration, policy iteration and natural actor-critic, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.63)

Add feedback

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Wagner, Paul

Neural Information Processing SystemsDec-31-2013

Approximate dynamic programming approaches to the reinforcement learning problem are often categorized into greedy value function methods and value-based policy gradient methods. As our first main result, we show that an important subset of the latter methodology is, in fact, a limiting special case of a general formulation of the former methodology; optimistic policy iteration encompasses not only most of the greedy value function methods but also natural actor-critic methods, and permits one to directly interpolate between them. The resulting continuum adjusts the strength of the Markov assumption in policy improvement and, as such, can be seen as dual in spirit to the continuum in TD($\lambda$)-style algorithms in policy evaluation. As our second main result, we show for a substantial subset of soft-greedy value function approaches that, while having the potential to avoid policy oscillation and policy chattering, this subset can never converge toward any optimal policy, except in a certain pathological case. Consequently, in the context of approximations, the majority of greedy value function methods seem to be deemed to suffer either from the risk of oscillation/chattering or from the presence of systematic sub-optimality.

convergence, iteration, policy iteration, (15 more...)

Neural Information Processing Systems

Country: Europe > Finland (0.04)

Industry: Education (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

A Unified View of TD Algorithms; Introducing Full-Gradient TD and Equi-Gradient Descent TD

Loth, Manuel, Preux, Philippe

arXiv.org Artificial IntelligenceDec-1-2009

This paper addresses the issue of policy evaluation in Markov Decision Processes, using linear function approximation. It provides a unified view of algorithms such as TD(lambda), LSTD(lambda), iLSTD, residual-gradient TD. It is asserted that they all consist in minimizing a gradient function and differ by the form of this function and their means of minimizing it. Two new schemes are introduced in that framework: Full-gradient TD which uses a generalization of the principle introduced in iLSTD, and EGD TD, which reduces the gradient by successive equi-gradient descents. These three algorithms form a new intermediate family with the interesting property of making much better use of the samples than TD while keeping a gradient descent scheme, which is useful for complexity issues and optimistic policy iteration.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Artificial Intelligence

cs/0611145

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
Europe > France (0.14)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

Add feedback

Filters

Collaborating Authors

optimistic policy iteration

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

2dace78f80bc92e6d7493423d729448e-Reviews.html

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

A Distributional Analysis of Sampling-Based Reinforcement Learning Algorithms

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

Optimistic policy iteration and natural actor-critic: A unifying view and a non-optimality result

A Unified View of TD Algorithms; Introducing Full-Gradient TD and Equi-Gradient Descent TD